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Abstract 

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear fea¬ 
tures from data. In the presence of a large volume of high dimensional data collected in a distributed fashion, it 
becomes very costly to communicate all of this data to a single data center and then perform kernel PCA. Can we 
perform kernel PCA on the entire dataset in a distributed and communication efficient fashion while maintaining 
provable and strong guarantees in solution quality? 

In this paper, we give an affirmative answer to the question by developing a communication efficient algorithm 
to perform kernel PCA in the distributed setting. The algorithm is a clever combination of subspace embedding and 
adaptive sampling techniques, and we show that the algorithm can take as input an arbitrary configuration of dis¬ 
tributed datasets, and compute a set of global kernel principal components with relative error guarantees independent 
of the dimension of the feature space or the total number of data points. In particular, computing k principal compo¬ 
nents with relative error e over s workers has communication cost Oi^spkjt + sk^ je^) words, where p is the average 
number of nonzero entries in each data point. Furthermore, we experimented the algorithm with large-scale real 
world datasets and showed that the algorithm produces a high quality kernel PCA solution while using significantly 
less communication than alternative approaches. 


1 Introduction 

Kernel Principal Component Analysis (KPCA) is a key machine learning algorithm for extracting nonlinear features 
from complex datasets, such as image, text, healthcare and biological data iniEiE]. The original kernel PCA algorithm 
is designed for a batch setting, where all data points need to fit into a single machine. However, nowadays large vol¬ 
umes of data are being collected increasingly in a distributed fashion, which poses new challenges for running kernel 
PCA. For instance, a large network of distributed sensors can collect temperature readings from geographically distant 
locations; a system of distributed data centers in an Internet company can process user queries from different countries; 
a fraud detection system in a bank needs to perform credit checks on people opening accounts from different branches; 
and a network of electronic healthcare systems can store patient records from different hospitals. It is very costly in 
terms of network bandwidth and transmission delays to communicate all of the data collected in a distributed fashion 
to a single data center, and then run kernel PCA on the central node. In other words, communication now becomes 
the bottleneck to the nonlinear feature extraction pipeline. How can we leverage the aggregated computing power 
in a large distributed system? Can we perform kernel PCA on the entire dataset in a distributed and communication 
efficient fashion while maintaining provable and strong guarantees in solution quality? 

While recent work shows how to do linear PCA in a communication efficient and distributed fashion Bl , the kernel 
setting is significantly more challenging. The main problem with previous work is that it achieves communication 
proportional to the dimension of the data points, which if implemented straightforwardly in the kernel setting would 
give communication proportional to the dimension of the feature space which can be very large or even infinite. 
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Kernel PCA uses the kernel trick to avoid going to the potentially infinite dimensional kernel feature space explicitly, 
so intermediate results are often represented by a function (e.g., a weighted combination) of the feature mapping of 
some data points. Communicating such intermediate results requires communicating all the data points they depend 
on. To lower the communication, the intermediate results should only depend on a small number of data points. A 
distributed algorithm then needs to be carefully designed to meet this constraint. 

In this paper, we propose a communication efficient algorithm for distributed KPCA in a master-worker setting 
where the dataset is arbitrarily partitioned and each portion sits in one worker, and the workers can communicate only 
through the master. Our key idea is to design a communication efficient way of generating a small representative 
subset of the data, and then performing kernel PCA based on this subset. We show that the algorithm can compute 
a rank-fc subspace in the kernel feature space using just a representative subset of size 0{k/e) built in a distributed 
fashion. For polynomial kernels, it achieves a (1 + e) relative-error approximation to the best rank-fc subspace, and for 
shift-invariant kernels (such as the Gaussian kernel), it achieves (1 -|- e)-approximation with an additive error term that 
can be made arbitrarily small. In both cases, the total communication for a system of s workers is 0{spk/e + sk^/e^) 
words, where p is the average number of nonzero entries in each data point, and is always bounded by the dimension 
of the data d and independent of the dimension of the kernel feature space. This for constant e nearly matches the 
lower bound ^l{sdk) for linear PCA Hj. As far as we know, this is the first algorithm that can achieve provable 
approximation with such communication bounds. 

As a subroutine of our algorithm, we have also developed an algorithm for the distributed Column Subset Selection 
(CSS) problem, which can select a set of 0{k/e) points whose span contains (1 -|- e)-approximation, with communi¬ 
cation 0{spk/e + sk'^). This is the first algorithm that addresses the problem for kernels, and it nearly matches the 
communication lower bound Vl{spk/e) for this problem in the linear case Q. The column subset selection problem 
has various applications in big data scenarios, so this result could be of independent interest. 

Furthermore, our algorithm also leads to some other distributed kernel algorithms; the data can then be projected 
onto the subspace found and processed by downstream applications. For example, an immediate application is for 
distributed spectral clustering, that first computes KPCA to rank-fc/e and then does fc-means on the data projected on 
the subspace found by KPCA {e.g., IhJ). This can be done by combining our algorithm with any efficient distributed 
A:-means algorithms {e.g., |[7l). 

We evaluate our algorithm on datasets with millions of data points and hundreds of thousands of dimensions where 
non-distributed algorithms such as batch KPCA are impractical to run. Furthermore, comparing to other distributed 
algorithms, our algorithm requires less communication and fewer representation data points to achieve the same ap¬ 
proximation error. 

Outline Sectionj^reviews related work, and Sectionj^reviews some preliminaries. Section|^provides an overview. 
Section [^presents our distributed kernel PCA algorithm, which consists of the key building blocks: kernel subspace 
embedding, computing (generalized) leverage scores, sampling representative points, and finally computing the solu¬ 
tion in the span of the selected data points. Section|^provides empirical results. Due to space limitation, the details of 
some proofs are deferred to the full version i). 


2 Related Work 

There has been a surge of recent work on distributed machine learning, e.g., m [lain] El. In this setting, the data sets 
are typically large, and small error rate is required. This is because if only a coarse error is needed then there is no 
need to use large-scale data sets; a small subset of the data will be sufficient. Furthermore, one prefers relative error 
rates instead of additive error rates, since the latter is worse and harder to interpret without knowing the optimum. Our 
algorithm can achieve small relative error with limited communication. 

Since there exist communication efficient distributed linear PCA algorithms mini, it is tempting to adopt the 
random feature approach for distributed kernel PCA: first construct m random features and then solve PCA in the 
primal form, i.e., apply distributed linear PCA on the random features. However, the communication of this method 
is too high. One needs m = 0{d/e^) random features to preserve the kernel values up to additive error e, leading to 
a communication of 0{skm/e) = 0{skd/e^). Another drawback of using random features is that it only produces a 
solution in the space spanned by the random features, but not a solution in the feature space of the kernel. 
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The Nystrom method is another popular tool for large-scale kernel methods: sample a subset of data points uni¬ 
formly at random, and use them to constrTict an approximation of the original kernel matrix. However, it also suffers 
from high communication cost, since one needs 0(l/e^) sampled points to achieve additive e error in the Frobenius 
norm of the kernel matrix ca. A closely related method is incomplete Cholesky decomposition US, where a few 
pivots are greedily chosen to approximate the kernel matrix. It is unclear how to design a communication efficient 
distributed version since it requires as many rounds of communication as the number of pivots, which is costly. 

Leverage score sampling is a related technique for low-rank approximation El. A prior work of Boutsidis et al. 
m gives the first distributed protocol for column subset selection. El gives a distributed PCA algorithm with optimal 
communication cost, but only for linear PCA. In comparison, our work is the first communication efficient distributed 
algorithm for low rank approximation in the kernel space. 

3 Backgrounds 

For any vector v, let ||ri|| denote its Euclidean norm. For any matrix M £ let Mi- denote its i-th row and 

M.j its j-th column. Let ||M||^ denote its Frobenius norm, and ||M ||2 denote its spectral norm. Let its rank be 
r < min {n, d}, and denote its SVD as M = UYjV^ where U G S G and V G Let denote 

its best rank-fc approximation. Finally, denote its number of non-zero entries as nnz(M). 

In the distributed setting, there are s workers that are connected to a master processor. Worker i has a local data 
set A* G and the global data set A G is the concatenation of the local data (n = ^i)- 

Kernels and Random Features. For a kernel k(x, x'), let T-L denote its feature space, i.e., there exists a feature 
mapping ^(•) G H such that k{x,x') = {(j){x),(j){x'))^. Let ^(A) G TdA denote the matrix obtained by applying (jj 
on each column of A and concatenating the results. Throughout the paper, we regard any M G "H" as a matrix whose 
columns are elements in H and define matrix operations accordingly. For example, for any M G TdA and N G "H™, 
let B = N G where Bij = {M-i, and let ||M||^ = tr M^. When there is no ambiguity, we 

omit the subscript T-L. 

The random feature approach is a recent technique to scale up kernel methods. Many kernels can be approximated 

m Sti where w^’s are randomly sampled. These include Gaussian RBF kernels and other shift- 

invariant kernels, inner product kernels, etc l lfThlflTll L For example, Gaussian RBF kernels, k{x, y) = exp(—||a: — 
2 /|P/ 2 (t^), can be approximated by A ^ Zuji,bi{x)Zuji,bi{y) where z^j^bix) = '/^cos{uj^x -f b) and coi is from a 
Gaussian distribution with density proportional to exp(—cr^ /2) and bi is uniform over [0,27r]. 

In this paper, we provide guarantees for shift-invariant kernels using Fourier random features (the extension to 
other kernels/random features is straightforward). We assume the kernel satisfies some regularization conditions: it is 
defined over bounded compact domain in R'^, with k( 0) < 1 and bounded V^fc(O) ifThl . Such conditions are standard 
in practice, and thus we assume them throughout the paper. 

Kernel PCA. An element u £ "H is an eigenfunction of (j){A)(j){A)^ with the corresponding eigenvalue A if ||u|| = 
1 and (j){A)cj){A)^ u = Xu. Given eigenfunctions {ui} of 0(A)(/)(A)^ and eigenvectors {rii} of (/>(A)^(/)(A), (j){A) 
has the singular decomposition UTikV^ + U±'S±V2 , where U, V are the lists of top k eigenfunctions/vectors, Efc is 
a diagonal matrix with the corresponding singular values, U±, V± are the lists of the rest of the eigenfunctions/vectors, 
and Ex is a diagonal matrix with the rest of the singular values. Kernel PCA aims to identify the top k subspace 
U, since the best rank-A: approximation [<j){A)]k = UYjkV^ = UU^(j>{A). Typically, the goal is to find a good 
approximation to this subspace. Formally, 

Definition 1. A subspace L G is a rank-k (1 + e, A)-approximation for kernel PCA on A if LAL = Ik and 

\\f{A) - < (1 + e) MA) - mA)]kf + A. 

Note that kernel PCA immediately leads to solutions for some other nonlinear component analysis such as kernel 
CCA, and also provides the needed subroutine for tasks like spectral clustering. 

Snbspace Embeddings. Subspace embeddings are a useful technique that can improve the computational and 
space costs by embedding data points into lower dimension while preserving interesting properties. Subspace em¬ 
beddings have been extensively studied in recent years ifTSl [T9l l20l 1^ l22l . The recent fast sparse subspace embed¬ 
dings II 22 I and its optimizations f23[ l24ll are particularly suitable for large-scale sparse datasets, since their running 
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time is linear in the number of non-zero entries in the data matrix. They also preserve the sparsity of the input data. 
Formally, 

Definition 2. An e-subspace embedding of M € is a matrix S S such that for any x, 

\\SMx\\ = (lie) \\Mx\\ . 

Subspace embeddings can also be done on the right hand side, i.e., S G and ||x^MS'|| = (1 i e) ||x^M||. 

Mx is in the column space of M and SMx is its embedding, so the definition means that the norm of any vector 
in the column space of M is approximately preserved. This then provides a way to do dimensional reduction for 
problems depending on inner products of vectors. Our algorithm repeatedly makes use of subspace embeddings. In 
particular, the embedding we use is the concatenation of the following known sketching matrices: CountSketch 
and i.i.d. Gaussians (or the concatenation of COUNTSketch, fast Hadamard and i.i.d. Gaussians). The details can be 
found in lfT4ll : we only need the following fact. 

Lemma 1. For M G there exist sketching matrices S G with t = Oinje^') that are e-subspace embed¬ 

dings. Furthermore, SM can be successfully computed in time 0(nnz(M)) with probability at least 1 — 5. 

The work of 1251 shows that a fast computational approach, TensorSketch, is indeed a subspace embedding for 
the polynomial kernel. However, there are no previously known subspace embeddings for other kernels. We develop 
efficient and provable embeddings for a large family of kernels including Gaussian kernel and other shift invariant 
kernels. These embeddings will be a key tool used by our algorithm. 

4 Overview 

In view of the limitations of the related work, we instead take a different approach, which first selects a small subset 
of points whose span contains an approximation with relative error rate e, and then find a low rank approximation 
in their span. It is important to keep the size of the subset small and also guarantee that their span contains a good 
approximation (this is also called kernel column subset selection). A well known technique is to sample according to 
the statistical leverage scores. 

Challenges. However, this immediately raises the following technical challenges. 

I. Computing the statistical leverage scores is prohibitively expensive. Naively computing them requires com¬ 
municating all data points. There exist non-trivial fast algorithms ||26l, but they are designed for the non-distributed 
setting. Using them in the distributed setting leads to communication linear in the number of data points, or linear in 
the number of random features if one uses random features and computes the leverage scores for them. 

Our key idea is that it is sufficient to compute the (generalized) leverage scores of the data points, i.e., the leverage 
scores of another matrix whose row space approximates that of the original data matrix. So the problem is reduced to 
designing kernel subspace embeddings that can approximate the row space of the data. 

II. Even given the embedded data, it is unclear how to compute its leverage scores in a communication efficient 
way. Although the dimension of the embedded data is small, existing algorithms will lead to communication linear in 
the number of data points, which is impractical. 

III. Simply sampling according to the generalized leverage scores does not give the desired results: a good ap¬ 
proximation can only be obtained using a much larger rank, specifically, 0{k/e). 

IV. After selecting the small subset of points, we need to design a distributed algorithm to compute a good low 
rank approximation in their span. 

Algorithm. We have designed a distributed kernel PCA algorithm that computes an approximated solution with 
relative erTor rate e using low communication. The algorithm operates in following key steps, each of which addresses 
one of the challenges mentioned above (See Figure [^l: 

I. Kernel Subspace Embeddings. To approximate the subspace of the original data matrix, we propose subspace 
embeddings for a large family of kernels. For polynomial kernels we improve the prior work by reducing the em¬ 
bedding dimension and thus lowering the communication. Furthermore, we propose new subspace embeddings for 
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(a) Compress data and compute leverage scores 


• •• ••• 





(b) Leverage score sampling 


• •• 


(d) Project data and compute KPCA 


Figure 1: Algorithm overview. The black machine at the center is the master and the gray machines are the workers. Each worker 
stores its portion of the dataset, and the algorithm computes the top k principle components on the whole dataset. The arrows 
between the machines denote the direction of communications. In each round, the communication always starts from the workers 
to the master (lighter arrows) and then from the master to the workers (darker arrows), (a) Each worker compresses its data by 
using (kernel) subspace embeddings and sends it to the master. The master aggregates the data and computes intermediate results 
for leverage scores and sends back to the workers, (b) Each worker computes the leverage scores, samples data points (denoted 
by circles) and then sends them to the master. The master distributes back the union of the sampled data points, (c) Each worker 
conducts adaptive sampling and sends newly sampled points to the master. The master distributes back the union of all sampled 
points, (d) Each worker projects its data onto the subspace spanned by the sampled data points and sends the compressed projections 
to the master. The master computes coefficients for the top k principle components by running SVD, and then sends them back to 
the workers, (best viewed in color) 


kernels with random feature expansions, allowing PCA for these kernels to be computed in a communication efficient 
manner. See Section lSTI for the details. 

II. Distributed Leverage Scores. To compute the leverage scores, sampling with constant approximations is suffi¬ 
cient. We can thus drastically reduce the number of data points: first do another (non-kernel) subspace embeddings on 
the embedded data, and then send the result to the master for computing the scores. See Figure[TJa) for an illustration 
and Section l572l for the details. 

III. Sampling Representative Points. We take a two-step approach as leverage scores alone is not good enough : 
first sample according to generalized leverage scores, and then sample additional points according to their distances 
to the span of the points sampled in the first step. The first step gains some coarse information about the data, and the 
second step use it to get the desired samples. The two steps are illustrated in Figure[TJb) and[TJc), respectively, while 
the details are in Section l53] 

IV. Computing an Approximation. After projecting the data to the span of the representative points, we sketch 
the projections by (non-kernel) subspace embeddings. We then send the compressed projections to the master and 
compute the solution there. See Figure[2d) for an illustration and Section 5.4 for the details. 

Main Theoretical Results. Given as input the local datasets, the rank k and error parameters e, A, our algorithm 
outputs a (1 -f e, A)-approximation to the optimum with large probability. Formally, 


Theorem I. Algorithm^^mduces a subspace Lfor kernel PCA on A that with probability > 0.99 satisfies: 

1. L is a rank-k (1 -f e, 0)-approximation when applied to polynomial kernels. 

2. L is a rank-k (1 -f e, /A)-approximation when applied to shift-invariant kernels with regularization. 
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The total communication is words, where p is the average number of nonzero entries in one data point. 

The constant success probability can be boosted up to any high probability 1 — (5 by repetition, which adds only an 
extra 0(log y) term to communication and computation. 

The output subspace L is represented by 0{k/e) sampled points Y from A (i.e., L = (l){Y)C for some coefficient 
matrix C), so L can be easily communicated and the projection of any point on L can be easily computed by the kernel 
trick. The communication has linear dependence on the dimension and the number of workers, and has no dependence 
on the number of data points, which is crucial for big data scenarios. Moreover, it does not depend on A (but the 
computation does), so the additive error can be made arbitrarily small with more computation. 

The theorem also holds for other properly regularized kernels with random feature expansions (see Hi HU for 
more such kernels); the extension of our proof is straightforward. 

We also make the following contributions; (/) Subspace embedding techniques for many kernels, (ii) Distributed 
algorithm for computing generalized leverage scores with low communication. (Hi) Distributed algorithm for kernel 
column subset selection. 


5 Distributed Kernel Principal Component Analysis 


Our algorithm first computes the (generalized) leverage scores that measure the non-uniform structure, then samples 
the desired subset of points whose span contains a good approximated solution, and finally finds such a solution in the 
span. 

Leverage scores are critical for importance sampling in many fast randomized algorithms. The leverage scores are 
defined as follows. 

Definition 3. For E G with SVD E = UYV^, the leverage score for its j-th column is £j = || Vj: ||^ . 


Their importance is reflected in the following fact: suppose E has rank at most k, and suppose P is a subset of 
Q( fciogfc ) obtained by repeatedly sampled from the columns of E according to their leverage scores, then the 

span of P contains an (1+e, 0)-approximation subspace for E with probability > 0.99 (see, e.g., IZTl '). Here, sampling 
one column according to the leverage scores £j means to define sampling probabilities pj such that pj > , 3 „ for 

t • 

all j, and then pick one column where the j-th column is picked with probability pp Note that setting pj = yPjr is 
clearly sufficient, but a constant variance of pj is allowed at the expense of an extra constant factor in the sample size. 


This means that it is sufficient to compute constant approximations ij for £j, and then sample according to pj = . 

y . j ^3 

However, even computing constant approximations of the leverage scores are non-trivial: naive approaches require 
SVD, which is expensive. Actually, SVD is more expensive than the task of PCA itself Even ignoring computation 
cost, naive SVD is prohibitive in the distributed setting due to its high communication cost. Fortunately, it turns out 
that the leverage scores are an over kill for our purpose; it suffices to compute the generalized leverage scores, i.e., the 
leverage scores of a proxy matrix. 


Definition 4. If E has rank q and can approximate the row space of M up to (1 + e, A), i.e., there exists X with 


\\XE - M||^ < (1 + e) \\M - [MUp + A, 


then the leverage scores of E are called the generalized leverage scores of M with respect to rank q. 

This generalizes the definition in by allowing the rank of E to be larger than k and allowing additive error A, 
which are important for our application. The generalized leverage scores can act as the leverage scores for our purpose 
in the following sense. 

Lemma 2. Let P be columns sampled from M according to their generalized leverage scores w.r.t. rank q. 

Then with probability > 0.99, the span of P has a rank-s (1 + 2e, 2A)-approximation subspace for M. 

Proof It follows from combining Theorem 5 in ETII and the definition of the generalized leverage scores. □ 
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Computing the generalized scores with respect to rank q could be much more efficient, since the intrinsic dimension 
now becomes q, which can be much smaller than the ambient dimension (the number of points or the dimension of the 
feature space). However, as noted in the overview, there are still a few technical challenges. 

• Efficiently find a smaller matrix E that can approximate the row space of the original data. 

• Compute the leverage scores of i? in a communication efficient way. 

• The approximation solution in the span of P has the same rank as E, which is 0(k/e) when we use kernel 
subspace embedding to obtain E. This is not satisfying since our final goal is to compute a rank-fc solution. 

• Find a good approximation in the span of (j>(Y) with low communication. 

Our final algorithm consists of four key steps, each of which addresses one of the above challenges. They are elaborated 
in the following four subsections respectively, and the final subsection presents the overall algorithm. 

5.1 Kernel Subspace Embeddings 

Recall that a subspace embedding S for a matrix M is such that ||S'Ma;|| « ||Ma;||, i.e., the norm of any vector in the 
column space of M is approximately preserved. Subspace embeddings can also be generalized for the feature mapping 
of kernels, simply by setting M = <j){A), S a linear mapping from "H and using the corresponding inner product. 

If the data after the kernel subspace embedding is sufficient for solving the problem under consideration, then only 
S(I){A) in much lower dimension is needed. This is especially interesting for distributed kernel methods, since directly 
using the feature mapping or the kernel trick in this setting will lead to high communication cost, while the data after 
embedding can be much smaller and lead to much lower communication cost. 

A sufficient condition for solving many problems (in particular, kernel PCA) is to preserve the low rank structure 
of the data. More precisely, the row space of S(j){A) is a good approximation to that of (j){A), where the error is 
comparably to the best rank k approximation error. Then S(j){A) can be used to compute the generalized leverage 
scores for 4>{A), which can then be utilized to compute kernel PCA as mentioned above. 

More precisely, we would like S(j){A) to approximate the row space of (f>{A) up to (1 + e, A), as required in the 
definition of the generalized leverage scores. We give such embeddings a particular name. 

Definition 5. S is called a (1 + e, A)-good subspace embedding for (j){A) C "H", if there exists X such that 

\\X{Sf{A)) - f{A)f < (1 + e) 110(A) - [0(A)],f + A. 

We now identify the sufficient conditions for (1 + e, A)-good subspace embeddings, which can then be used in 
constructing such embeddings for various kernels. 

Lemma 3. S is a (1 + e, A)-good subspace embedding for 0(A) € Tf" if it satisfies the following. 

PI (Subspace Embedding): For any orthonormal V G (i.e., V is the identity), for all x G 

||^Ex|| = (l±c)||yx|| 

where c is a sufficiently small constant. 

P2 (Approximate Product): for any M G H”, N G 

||(S'A)^(S'M) - N^M\\l < ^ ||Af \\Mf + A. 

Polynomial Kernels. For polynomial kernels, there exists an efficient algorithm Tens ORS KETCH to compute the 
embedding Il25l . However, the embedding dimension has a quadratic dependence on the rank k, which will increase 
the communication. Fortunately, subspace embedding can be concatenated, so we can further apply another known 
subspace embedding such as one of those in Lemma [T] which, though not fast for feature mapping, is fast for the 
already embedded data and has lower dimension. In this way, we can enjoy the benefits of both approaches. 

The guarantee of TensorSketch in ||25]| and the property of the subspace embeddings in Lemma can be 
combined to verify PI and P2. So we have 
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Algorithm 1 Distributed Leverage Scores: {£*} = disLS({i?*}*j^ , k) 


1: Each worker i: do |-subspace embedding E’'T^ € 


with p = 0{t)-, send E’-T^ to Master. 


Master: QR-factorize [E^T^, ..., E^T^] = UZ; send Z to all workers. 


3: Each worker i: compute = 


(iz^r^E^) 


Lemma 4. For polynomial kernels k{x, y) = ((x, y)Y, there exists an (1 -|- e, Q)-good subspace embedding matrix 
S ^ K* with t = 0{k/e). 

Kernels with Random Feature Expansions. Polynomial kernels have finite dimensional feature mappings, for 
which the sketching seems natural. It turns out that it is possible to extend subspace embeddings to kernels with 
infinite dimensional feature mappings. More precisely, we propose subspace embeddings for kernels with random 
feature expansions, i.e., k{x, y) = for some function ^(•). Therefore, one can approximate the kernel 

by using m features z^{x) on randomly sampled w. Such random feature expansion can be exploited for subspace 
embeddings: view the expansion as the “new” data points and apply a sketching matrix on top of it. Compared to 
polynomial kernels, the finite random feature expansion leads to an additional additive error term. Our analysis shows 
that bounding the additive error term only requires sufficiently large sampled size m, which affects the computation 
but does not affect the final embedding dimension and thus the communication. 

In summary, the embedding is Scj){x) = TR{(j){x)), where i?((/)(x)) G is m random features for x and 
T G is an embedding as in Lemma[T] The properties PI and P2 can be verified by combining Lemma[T]and the 
guarantees of random features. 

Lemma 5. For a continuous shift-invariant kernels k{x, y) = k(x — y) with regularization, there exists an (1 + e, A)- 
good subspace embedding 5" : "H i—>■ M* with t = 0{k/e). 


5.2 Computing Leverage Scores 


Given the matrix E obtained from kernel subspace embedding, we would like to compute the leverage scores of 
E. Eirst note that this cannot be done simply in a local manner: the leverage score of a column in E* is different from 
the leverage score of the same column in E. Eurthermore, though data in E have low dimension, communicating all 
points in E to the master is still impractical, since it leads to communication linear in the total number of points. 

Eortunately, we only need to compute constant approximations of the scores, which allows us to use subspace 
embedding on E to greatly reduce the number of data points. In particular, we apply a |-subspace embedding T* 
(e.g., one of those in Lemma[^ on each local data set E'^, and then send them to the master. Let ET denote all the 
embedded data, and do QR factorization {ET)^ = UZ. Now, the rows of = {Z^) ^ ET are a set of basis for 


ET. Then, think of C/^Tf = (Z^) 
{Z^Y^ E. 


as the basis for E, so it suffices to compute the norms of the columns in 


The details are described in Algorithm and Eigure [TJa) shows an illustration. The algorithm is guaranteed to 
output constant approximations of the leverage scores of E. 


Lemma 6. Let be the true leverage scores 


of E. Then Algorithm^outputs 


(1 ± 1/2)£J. 


Proof. The algorithm can be viewed as applying an embedding T = diag (T^,..., T®) on E to approximate the 
scores while saving the costs. Each T* is an f-subspace embedding matrix, then for any x. 


\\x^ETf = \\[x^ E^T^ ,x^ E^T^,... ,x^ E^^T^f = \\x^ E^Tf = ^(1± 1/4)^ 

2 — 1 2—1 

= ( 1 ± 1 / 4)2 \\x^Ef . 

So T is also |-subspace embedding. Such a scheme of using embedding for approximating the scores has been 
analyzed (Lemma 6 in ll2^ ). and the lemma follows. □ 
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Algorithm 2 Sampling Representative Points: Y = RepSample({A*}*j^ , {P'j}-, k, e) 

1: Workers: sample 0{k log k) points according to {ij}', send to Master; 

2: Master: send all the sampled points P to the workers; 

3: Workers: sample 0{k/e) points Y according to the square distances to P in the feature space; send to Master; 
4: Master: send Y = F U P to all the workers. 


Algorithm 3 Computing an Approximation: L = disLR({A®}^ Y, k, e, A) 

1: Each worker i: compute the basis Q for (/<(F) and If® = do an e-subspace embedding n®T® G 

with w = 0(|F|/e^), and send n®T® to Master; 

2: Master: concatenate IfP = , If^T®] and send the top k singular vectors W of IIT to the workers. 

3: Each worker i: set L = QW. 


We note that though a constant approximation is sufficient for our purpose, but the algorithm can output = 
(1 ± e)^® by doing an |-subspace embedding (instead of j), which can be useful for other applications. 

5.3 Sampling Representative Points 

Sampling directly to the leverage scores can produce a set of points P such that the span of (j){P) contains a 
(1 + e, A)-approximation to 4>{A). However, the rank of that approximation can be as high as 0{k/e), since its rank 
is the same as that of the embedded data (see Lemma |^, which will be 0{k/e) to achieve e error. To get a rank-A: 
approximation and also enjoy the advantage of leverage scores, we propose to combine leverage score sampling and 
the adaptive sampling algorithm in Il28ll29]l . 

The details are presented in Algorithm]^ We first sample a set P of 0{k\ogk) points according to the leverage 
scores, so that the span of (j){P) contains a (2, A)-approximation. Then we use the adaptive sampling method: sample 
0{k/e) points according to the square distances from the points to their projections on P and then add them to P to 
get the desire set Y of representative points. Eigure[2b) and[2c) demonstrate the two steps of the algorithm. 
Adaptive sampling has the following guarantee: 

Lemma 7. Suppose there is a (2, A)-approximation for in the span of (j){P). Then with probability > 0.99, the 

span of(j){Y) has a rank-k (1 -f e, A)-approximation. 

Therefore, we solves the column subset selection problem for kernels in the distributed setting, with Oik log k -f 
k/e) selected columns and with a communication of only 0{spk/e -f sfc^). This also provides the foundation for 
kernel PCA task. 

5.4 Computing an Approximation 

To compute a good approximation in the span of (j){Y), the naive approach is to project the data to the span and 
compute SVD there. However, the communication will be linear in the number of data points. Subspace embedding 
can be used to sketch the projected data, so that the number of data points is greatly reduced. 

Algorithm|^describes the details and Eigure[TJd) shows an illustration. To compute the best rank-fc approximation 
for the projected data H, we do a subspace embedding on the right hand side, i.e., compute HT = [H^T^,..., H^T®]. 
Then the algorithm computes the best rank-fc approximation W for HT, which is then a good approximation for H and 
thus (j){A). It then returns L, the representation of W in the coordinate system of fi^A). The output L is guaranteed to 
be a good approximation. 

Lemma 8. If there is a rank-k (1 -f e, A)-approximation subspace in the span of (j){Y), then 
\\LL^f{A) - 0(A)f < (1 + MA) - [f[A)]^\f + (1 + e)A. 
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Algorithm 4 Distributed Kernel PCA: L = disKPCA({ , k, e, A) 

1: Each worker i: do a (1/4, A)-good subspace embedding E’* = S'((/(A®)) € = 0(fc); 

2: Compute the leverage scores: {£*} = disLS({E*}*_^, k); 

3: Sample points: Y = RepSample({A®}*_^, {1]}, k, e); 

4: Output L = disLR({ , Y, fc, e, A). 


Proof Sketch. For our choice of w, T® is an e-subspace embedding matrix for If®. Then their concatenation B is an 
e-subspace embedding for If, the concatenation of If®. Then we can apply the idea implicit in im. 

By Pythagorean Theorem, the error can be factorized into 

WLL^f^A) - Q^(j){A)\\^ + \\cj,{A) - QQ^f{A)f . 

■' -^-- '-V-' 

T1 T2 

Since LL^ = QWW^Q^, 

T1 = \\WW^Q^f{A) - Q^4>{A)f . 

Note that If = <!){A), and W is the best rank-A: subspace for its embedding flT. By property of T (Theorem 7 

in mi it is also a good approximation for If. So 

Tl « \\[Q^f{A)]k - = \\Q[Q^f{A)]k - ■ 

Combining this with T2, and applying Pythagorean Theorem again, we know that the error is roughly 

\\Q[Q^f{A)], - f{A)\\\ 

Now, by assumption, there is arank-fc A)-approximation subspace X in the span of (j>{Y). Since [Q^4>{A)]i. 

is the best rank-A: approximation to (j){A), 

\\Q[Q^cl>{A)]k-f{A)\\^ 

= \\Q[Q^f{A)]k - QQ^fiA)^ + WQQ^fiA) - 
< ||X - QQ^f{A)f + \\QQ^(I>{A) - 4>{A)f 
= \\X-f{A)\\\ 

The lemma then follows. 

5.5 Overall Algorithm 

Now, putting things together, we obtain our final algorithm for distributed kernel PCA (Algorithm]^. Our main result, 
Theorem[T] follows by combining all the lemmas in the previous subsections (with properly adjusted e and A). 

6 Experiments 

6.1 Datasets 

We use ten datasets to evaluate our algorithm. They contain both sparse and dense data and come from a variety of 
different domains, such as text, images, high energy physics and biology. We use two smaller ones to benchmark 
against the single-machine batch KPCA algorithm while the rest are large-scale datasets with up to tens of millions of 
data points and hundreds of thousands dimensions. Refer to Table[^for detailed specifications. 

Each dataset is partitioned on different workers according to the power law distribution with exponent 2 to simulate 
the distribution of the data over large networks ll32l . Depending on the size of the dataset, the number of workers used 
ranges from 5 to 200 (see Table[T]for details). 
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Table 1: Dataset specification: d is the original feature dimension, n is the number of data points, and s is the total 
number of workers storing the dataset distributedly. Among them, bow and 20news are sparse datasets. All datasets 
except mnistSm are taken from UCI repository OOl and ll3Tl . 


Dataset 

d 

n 

s 

bow 

100,000 

8,000,000 

200 

higgs 

28 

11,000,000 

200 

mnistSm 

784 

8,000,000 

100 

susy 

18 

5,000,000 

100 

yearpredmsd 

90 

463,715 

10 

ctslice 

384 

53,500 

10 

20news 

61,118 

11,269 

5 

protein 

9 

41,157 

5 

har 

561 

10,299 

5 

insurance 

85 

9,822 

5 


6.2 Experiment Settings 

Since our key contribution is sampling a small set of data points intelligently, the natural alternative is uniformly 
sampling. We compare with two variants of uniform sampling algorithms: 1) uniformly sampling representative 
points and use Algorithm[^to get KPCA solution (denoted as uniform+disLR); 2) uniformly sampling data points and 
apply batch KPCA (denoted as uniform+batch KPCA). 

For both algorithms, we compare the tradeoff of low rank approximation error and communication cost. Particu¬ 
larly, we compare the communication needed to achieve the same error. Each method is run 5 times and the mean and 
the standard deviation are reported. 

For polynomial kernel, the degree is g = 4 and for Gaussian RBF kernel, the kernel bandwidth a is set to 0.2 of 
the median pairwise distance among a subset of 20000 randomly chosen data points (a.k.a, the “median trick”). For 
Gaussian random feature expansion, we use 2000 random features. 

In all experiments, we set the number of principle components k = 10, which is the same number for fc-means. 
The algorithm specific parameters are set as follows: 1) The subspace embedding dimension for the feature expansion 
t is 50; 2) The subspace embedding dimension for the data points p is 250; 3) We vary the number of adaptively 
sampled points jF| from 50 to 400 to simulate different communication cost; 4) The subspace embedding dimension 
w is set to equal \Y\. 

6.3 Comparison with Batch Algorithm 

We compare to the “ground-truth” solutions produced by batch KPCA on two small datasets where it is feasible. 
The experiment results for the polynomial kernel and the Gaussian RBF kernel are presented in Figures and 
respectively. 

In both cases, the low-rank approximation error of disKPCA decreases as more communication (more represented 
points) is allowed. It can nearly match the optimum low-rank approximation error with much fewer data points. In 
addition, it is much faster: we gain a speed up of 10x by using five workers. 

6.4 Communication Efficiency 

In this set of experiments, we focus on comparing the tradeoff between communication cost and approximation accu¬ 
racy on large-scale datasets. The alternative, uniform + batch KPCA, is stopped short in many experiments due to its 
excessive computation cost for larger number of sampled data points. 

Figure 1^ demonstrates the performance on polynomial kernels on four large datasets. On all four datasets, our 
algorithm outperforms the alternatives by significant margins. Especially on bow, which is a sparse dataset, the usage 
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Figure 2: KPCA for polynomial kernels on small datasets; low-rank approximation error and runtime 
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Figure 3: KPCA for Gaussian kernels on small datasets: low-rank approximation error and runtime 
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Figure 5: KPCA for Gaussian kernels on larger datasets 
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Figure 6; KPCA results for arc-cos kernels 




Figure 7: KPCA scaling results 


of kernel embeddings takes advantage of the sparsity structure and leads to much smaller error. On other datasets, 
uniform + disLR cannot match the error achieved by our algorithm even when using much more communication. 

Figure|^shows the performance on Gaussian kernels. On mnistSm, the error for uniform + batch KPCA is so large 
(almost twice of the errors in the figure) that it is not shown. On other datasets, disKPCA achieves significant smaller 
error. For example, on higgs dataset, to achieve the same approximation error, uniform + disLR requires more than 5 
times communication. Since it does not have the communication of computing leverage scores, this means that it needs 
to sample much more points to get similar performance. Therefore, our algorithm is very efficient in communication 
cost. 

Besides polynomial and Gaussian kernels, we have also conducted experiments using arc-cos kernel The 
arc-cosine kernels have random feature bases similar to the Rectified Linear Units (ReLU) used in deep learning. In 
the experiments, we use degree n = 2 and Figure shows the results. Our algorithm consistently achieves better 
tradeoff between communication and approximation and the benefit is especially more pronounced on sparser dataset 
such as 20news. 


6.5 Scaling Results 

In Figure]^ we present the scaling results for disKPCA. In these experiments, we vary the number of workers and 
record the corresponding computation time (communication time excluded). On both datasets, the runtime decreases 
as we use more workers, and it eventually plateaus. Our algorithm gains about 2 x speedup by using 4 x more workers. 
Note that our algorithm is designed to strike a good balance between communication and approximation. Even though 
computation complexity is not our first priority, the experiments show disKPCA still enjoys favorable scaling property. 

6.6 Distributed Spectral Clustering 

We have also experimented using A:-means clustering as a downstream application. Such combination is known as 
a form of spectral clustering. We project the data onto the top k principle components and then apply a distributed 
A:-means clustering algorithm ||7]. The evaluation criterion is the fc-means objective, i.e., average distances to the 
corresponding centers, in the feature space. 

Figure [8(a)| presents results for polynomial kernels on the 20news and susy datasets and Figure [8(b)] presents results 
for Gaussian kernels on ctslice and yearpredmsd datasets. Our disKPCA algorithm compares favorably with the other 
methods and achieves a better tradeoff of communication and error. This means that although the other methods 
require similar communication, they need to sample more data points to achieve the same loss, demonstrating the 
effectiveness of our algorithm. 


7 Conclusion 

This paper proposes a communication efficient distributed algorithm for kernel Principal Component Analysis with 
theoretical guarantees. It computes a relative-error approximation compared to the best rank-A: subspace, using com¬ 
munication that nearly matches that of the state-of-the-art algorithms for distributed linear PCA. This is the first 
distributed algorithm that can achieve such provable approximation and communication bounds. The experimental 
results show that it can achieve better performance than the baseline using the same communication budget. 
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Figure 8; KPCA + fc-means clustering 
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A Remark on Using Kernel Tricks in the Algorithms 

In computing the final solution from V in Algorithmic we need to compute the projection of 4>{A) onto 4>0^)- This 
can be done by using kernel trick and implicit Gram-Schmidt. Note that 11® = where Q is the basis for (j){P). 

Suppose (j>{Y) has QR-factorization (piY) = QR. Then Q = (j>{Y)R~^ and (j){A) = where 

(j){Y)^ (1){A) is just the kernel value between points ink" and points in A. For i?, we have = {R~^)^ (f>{Y)^ (j){Y)R~^ 
I, so R^R = (/)(F)^(/)(y) and thus R can be computed by factorizing the kernel matrix on Y. 

Similarly, to compute the distance for adaptive sampling in Algorithmic we first need to compute the projection of 
4>{A) onto (/'(F’), which can be done in the same way. Then the square distance from the original data to the projection 
can be computed by subtracting the square norm of the projection from the square norm of the original data. 


B Additional Proofs 

The section provide the missing proofs in the main text. Some proofs mainly follow known arguments in the literature, 
slightly generalized to handle the additive error term for our purpose; we include them here for completeness. 


B.l Properties of Subspace Embeddings for Kernels 

Lemma|C S is a (1 + e, A)-good subspace embedding for ^{A) € RA if it satisfies the following. 
PI (Subspace Embedding): For any orthonormal V G fi.e., V is the identity), for all x G 

IISVxll = (l±c)IIVxll 


where c is a sufficiently small constant. 

P2 (Approximate Product): for any M G R’^, N G R^, 

||(S'Ar)^(S'M) - < I ||Aff \\Mf + A. 

Proof. Let := [(/)(A)]fe and f := fiA). Suppose fk has SVD UYV^. Define 

X := argmin||5'(^fe)X - S'((/>)||^ 


and also note that I = argmin^^ — fWp since fik is defined to be the best rank-A: approximation of f. 

First, consider bounding := fkiX — I) = {fkX — 4>k)- We have 

/3 = S{IJ)^S{U)I3 + {S{iJ)^S{iJ) - I)fi. 

Ti T2 


For the first term, since S' is a linear mapping, 

Ti = S{U)^S{U)fi = SiU)^S{UU^cj>k){X -I) = S{fk){X - I). 

Also note that by definition of X, we have S{(l)k)^— S(0)) = 0. Since U = fkW for some W, 
SiU)^iSifik)X - S{f)) = W^S(^fkViSifk)X - Sif)) = 0. So 

Ti = S{U)^S{fk){X - /) + SiU)^{S{cl>) - S{fk)X) = S{U)^{S{f) - Sifk)) 


which leads to 


iTi 


Af< k 


U 


H 


Wf-fkWli + A = 


For the second term, by the subspace embedding property, we have 


|T2||^< 


S{U)^S{U)-I 


eWf - fkWli + A. 

[mi¬ 


le 














Therefore, we have 


If ^ T-2 U-M 

i Cn 


A 

1 _ ^2- 
1 Cq 


Since I = argmin;^ i4>k ~ 0) = 0. Then by Pythagorean Theorem, 


Since 


(j)kX — (j)k 


4>kX - 4> 


, we arrive at 




= Uk - ^W-H 


(pkX — (j)k 


H 


H 


4>kX - (j) 


H 


< (1 + 0(e)) \\(j)k — ' 




0(A). 


Note that X = {S{(j)k)y S[(j)) where (S'(())fc))^' is the pseudoinverse of S{4>k)- Then (j)kX = 4>k{S{4'k))'' E, and 
W := 4>k{S{(j)k)y satisfies the statement. □ 


B.2 Existence of Subspace Embeddings for Kernels 

Lemmaj^ For polynomial kernels k{x, y) = ((x, y)Y, there exists an (1 + e, 0)-good subspace embedding matrix 
S ^ K* with t = 0{k/e). 

Proof. First use TensorSketch ll25i to bring the dimension down to 0{3‘^k^ + k/e). Then, we can use an i.i.d. 
Gaussian matrix, which reduces it to f = 0{k/e)', or we can first use fast Hadamard transformation to bring it down 
to 0(fcpolylog(/c)/e), then multiply again by i.i.d Gaussians to bring down to 0{k/e). PI follows immediately from 
the definition, so we only need to check the matrix product. Let S = flT where T is the TensorSketch matrix and 
n is an i.i.d. Gaussian matrix. Since they are both subspace embedding matrices, then for any M and N, 


M' S' SN-M'T'TN\\ < W- ||Mr||^ ||AT|| 


IF ’ 


M'T'TN - M' N\\ < WpllMILIlAll 


■ 


By the subspace embedding property, ||Mr||^ = (1 ± eo) || AT|j^ = (1 ± eo) for some small 

constant cq. Combining all these bounds and choosing proper e, we know that S = SIT satisfies P2. □ 


Lemma|^ For a continuous shift-invariant kernels k{x, y) = k{x — y) with regularization, there exists an (1 + e, A)- 
good subspace embedding S' : "H i—>■ IR* with t = 0{k/e). 

Proof Let R{(j>{x)) be the random feature expansion matrix with m random features. Let T G be a subspace 

embedding matrix. We define S(f/>(x)) := TR{(f){x)). Note that S(-) is a linear mapping from H to K*. 

For the random feature expansion, we have 

Claim 1 (Claim 1 in Iflhl ). Let k be a continuous shift-invariant positive-definite function k{x, y) = k{x — y) defined 
on a compact set ofM.^ of diameter D, with k(0) = 1 and such that W^k{0) exists. Let denote the second moment 
of the Fourier transform ofk. Then \k{x, y) — R((j){x))^ R{(f){jj))\ < eo with probability < 1 — i5 when 



For the subspace embedding matrix, we have 


Claim 2 (Lemma 32 in 1221). For A and B matrices with n rows, and given e > 0, there is t = 0(e^), so that for 
a t X n generalized sparse embedding matrix S, or t x n fast JL matrix, or tlog{nd) x n subsampled randomized 
Hadamard matrix, or leverage-score sketching matrix for A under the condition that A has orthonormal columns, 

Pr [||(SA)TSS - A^B\\^ < e ||A||^ ||B||^] >1-5 

for any fixed constant 5 > 0. 
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Then we have 


\\S{cl>{A)yS{cp{B)) - ^{A)^cl>{B)\\l < ||5(<^(A))T5(</.(i?)) - Ri<l>{A)yR{^{B))\\l 

+ \\Ri^iA)f R{^{B)) - ^{A)^^iB)\\l. 

For the first term, we have 

||5(0(^))T^(</,(B)) - R{cj,{A)yR{^iB))\\l <e^ \\RWA))\\l \\RWB))\\l 

<e^mA)\\lmB)\\l + 0{e^abeo). 

where a is the number of columns in A and b is the number of columns in B. Similarly the second term is bounded by 
O(aeo + beo + abcQ). So the matrix product approximation is satisfied with 

\\S{cl,{A)yS{cl,{B)) - cj>{Aycl,{B)\\l < ||</>(A)|1^ ||</>(S)||« + O{e^abeo) + O{aeo + beo + abel). 

Now consider the subspace embedding condition. Let V be an orthonormal basis spanning the subspace. Then the 
condition is equivalent to saying S{V)^S{V) have bounded eigenvalues in [1 ± 0{e)]. Note that 

||5(y)^^(l/) - < e^ ll^ll^ + O{e^eeo) + O{keo + fceo + 

where ||y||^ = fc^. 

Then choosing f = 0{k/e) and m = max |o((ia^/A^), 0((i6^/A^), 0((e^a6/A:^A)^d), O(dfc^/co)| satisfies 
the two conditions. □ 


B.3 Adpative Sampling 

Lemma|^ Suppose there is a (2, -approximation for (j){A) in the span of 4>{P). Then with probability > 0.99, the 
span of (j){Y) has a rank-k (1 + e, A)-approximation. 

Proof Let Z denote the best rank-A: approximation for ^(A) in the span of 4>{Y), and let W denote the projection of 
(/)(A) on the span of 4'{P). By Theorem 3 in 1281, we have 


E 


mA)-z\ 


H 




where ||(^(A) — VL|1^ < c||(^(^) — [(/)(A)]fc||^ + A by our assumption. Our lemma is then proved by Markov’s 
inequality. □ 


B.4 Compute Approximation Subspace 

Lemmaj^ If there is a rank-k (1 + e, A)-approximation subspace in the span of 4>{Y), then 

\\LL^f{A) - f{A)f < (1 + e)2 ||<^(yl) - [f[A%\f + (1 + e)A. 

Proof For our choice of w, T® is an e-subspace embedding matrix for If®. Then their concatenation B is an e-subspace 
embedding for If, the concatenation of If®. Then we can apply the argument in Lemma 5 in ||25]| (also implicit in 
Theorem 1.5 in ifTTl i. Below we give a simplified proof. 

First, let X denote the (1 -l-e, A)-approximation subspace in the span of (j>{Y). Since [Q^(j>{A)]k is the best rank-Zc 
approximation for (j>{A), 

\\Q[Q^cl>{A)]k-f{A)\\^ 

= \\Q[Q^f{A)]k - QQT0(A)f + WQQ^fiA) - 

< ||X - QQ^f{A)f + \\QQ^(I>{A) - 4>{A)\\^ 

= \\X-f{A)\\\ 
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Therefore, 


\\Q[Q^<j^{A% - <t>{A)\^^ < MA) - X\\l, < (1 + e) U{A) - [ct>{A)]k\\n + (D 

Second, we apply Theorem 7 in im on Q^(j>{A). Note that that theorem is stated for a specific subspace embed¬ 
ding scheme but it holds for any subspace embedding. Then we have 

\\WW^Q^(l>iA)-Q^cj,{A)\\l^ < {l + e)\\[Q^cl,{A)]k-Q^cj){A)\\l^. ( 2 ) 

We now bound the error using the above two claims. By Pythagorean Theorem, 

||LLT^(A) - cj){A)\\l^ = WlL^HA) - + \\(l>iA) - . (3) 

Noting L = QW, the first term on the RHS is 

\\LL^^(A) - QQ^(l}{A)\\l^ = \\WW^Q^(j)iA)-Q^^{A)\\l^ 

= {l + e)\\Q[Q^4>{A)]k-QQ^<l^{A)\^u^ 

where the inequality is by Q and the equalities are because multiplying Q on the vectors in the span of Q does not 
change their norms. Plugging into ([^, 

< (1 + e) \\Q[Q^4>{A)]k - QQ^ <t>{A)f^ + " QQ^ <t>{A)f^ ) 

— A e)'^Q[Q^(t){A)\k — 4>{A)'^^ 

< il + ermA)-[^iAMl + il + e)A, 

where the second inequality is by Pythagorean Theorem and the last is by ([T]). □ 
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